Linear regression (lab 3 for BIOS 26122)

Author

Dmitry Kondrashov

Description

The goal of this assignment is to use linear regression for both single and multiple variables. You will do the following:

  1. Calculate single variable linear regression and assess its quality

  2. Compare errors for training and test set

  3. Use polynomial regression and assess its quality

  4. Check for overfitting and select appropriate model

Heart rates

The following data set contains heart rates measured and reported by students in my class Introduction to Quantitative Modeling for Biology. There are four different heart rates measured (two at rest and two after exercise) and the year it was measured.

heart_rates <- read_csv("https://raw.githubusercontent.com/dkon1/intro-ml-bio/main/labs/data/HR_data_combined.csv")
Rows: 1068 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): Rest1, Ex1, Rest2, Ex2, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Select a response and an explanatory variable (from the 4 heart rate variables) and clean the data to remove any outliers or missing values in these variables. Split the data set into training and test sets of equal size.

    heart_data <- heart_rates |> 
      dplyr::select(Rest2, Ex2) |> 
      drop_na() 
    
    
    train_index <- sample(nrow(heart_data), size = floor(0.5 * nrow(heart_data)))
    
    heart_train <- heart_data |>  
      slice(train_index) |> arrange(Rest2)
    
    heart_test <- heart_data |>  
      slice(-train_index) |> arrange(Rest2)
  2. Use linear regression on the training set and print out the summary. Make a scatterplot and overlay the linear regression model as a line on the same plot. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test.

lm_out <- lm(Ex2 ~ Rest2, data = heart_train)

summary(lm_out)

Call:
lm(formula = Ex2 ~ Rest2, data = heart_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.527 -13.675  -2.692  10.316  95.583 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.02093    5.47303   6.764 3.54e-11 ***
Rest2        0.92575    0.07145  12.957  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.71 on 532 degrees of freedom
Multiple R-squared:  0.2399,    Adjusted R-squared:  0.2385 
F-statistic: 167.9 on 1 and 532 DF,  p-value: < 2.2e-16
# base R:
plot(Ex2 ~ Rest2, data = heart_train, cex = .8, col = "blue", main = paste("Linear regression over heart data"))
abline(lm_out)

#produce residual vs. fitted plot
plot(fitted(lm_out), resid(lm_out))

#add a horizontal line at 0 
abline(0,0)

# ggplot:
 heart_train |> ggplot() + 
  aes(x = Rest2, y = Ex2) + geom_point(color = 'blue') +
  geom_smooth(method = 'lm', color = 'darkorange') + ggtitle(paste("Linear regression over heart data"))
`geom_smooth()` using formula = 'y ~ x'

 ggplot(lm_out, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  ggtitle(paste("Residuals of regression on heart data"))

The outliers look like a shapeless blob, and the p-values for both the slope and intercept are extremely small, indicating they’re certainly different from zero.

  1. Perform quadratic regression on the training set using lm and print out the summary. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test, and report whether the fit is improved compared to the linear model.

    quad_out <- lm(Ex2 ~ poly(Rest2, degree = 2, raw = TRUE), data = heart_train)
    summary(quad_out)
    
    Call:
    lm(formula = Ex2 ~ poly(Rest2, degree = 2, raw = TRUE), data = heart_train)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -44.453 -14.116  -2.879  10.041  96.495 
    
    Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                          -6.218089  22.628083  -0.275 0.783580    
    poly(Rest2, degree = 2, raw = TRUE)1  2.051981   0.576402   3.560 0.000404 ***
    poly(Rest2, degree = 2, raw = TRUE)2 -0.007154   0.003633  -1.969 0.049473 *  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 19.65 on 531 degrees of freedom
    Multiple R-squared:  0.2454,    Adjusted R-squared:  0.2426 
    F-statistic: 86.34 on 2 and 531 DF,  p-value: < 2.2e-16
    # base R:
    plot(Ex2 ~ Rest2, data = heart_train, cex = .8, col = "blue", main = paste("Quadratic regression on heart rate data"))
    
    y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*heart_train$Rest2 + quad_out$coefficients[3]*heart_train$Rest2^2
    
    lines(heart_train$Rest2, y_pred, lwd = 3, col = 'darkorange')

    #produce residual vs. fitted plot
    plot(fitted(quad_out), resid(quad_out), main = paste("Residuals of quadratic regression on heart data"))
    #add a horizontal line at 0 
    abline(0,0)

    # ggplot:
    heart_train |> ggplot() + 
      aes(x = Rest2, y = Ex2) + geom_point(color = 'blue') +
      geom_smooth( method = 'lm', formula = 'y ~ poly(x, 2, raw=TRUE)', color = 'darkorange') + ggtitle(paste("Quadratic regression"))

    ggplot(quad_out, aes(x = .fitted, y = .resid)) +
      geom_point() +
      geom_hline(yintercept = 0) +
      ggtitle(paste("Residuals of quadratic regression on heart data"))

    The outliers still look good, and the r-squared is improved a bit (by 1-2% in most cases, depending on the random split). The parameters are usually significant, though the quadratic coefficient is very small.

  2. Use the linear regression model parameters to compute the predicted values for the test set, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. Use the quadratic regression model to compute the predicted values for the test set using the quadratic regression coefficients, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. How did adding the quadratic parameter impact the error for the training and test sets?

    print(paste("Residual variance from linear model for the training set:", var(lm_out$residuals)))
    [1] "Residual variance from linear model for the training set: 387.611676616165"
    y_pred <- lm_out$coefficients[1] + lm_out$coefficients[2]*heart_test$Rest2
    
    print(paste("Residual variance from linear model for the test set:", var(y_pred - heart_test$Ex2))) 
    [1] "Residual variance from linear model for the test set: 369.884552666365"
    print(paste("Residual variance from quadratic model for the training set:", var(quad_out$residuals)))
    [1] "Residual variance from quadratic model for the training set: 384.802144846724"
    y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*heart_test$Rest2 + quad_out$coefficients[3]*heart_test$Rest2^2
    
    print(paste("Residual variance from quadratic model for the for the test set:", var(y_pred - heart_test$Ex2))) 
    [1] "Residual variance from quadratic model for the for the test set: 360.537551724502"

    Neither the training nor the test set residual variance is improved dramatically by adding the quadratic terms, but for most splits, the training set is improved more, and sometimes the error in the test set goes up, indicating potential overfitting (though nothing dramatic).

  3. Redo the random split of the original data set into training and test sets, leaving progressively smaller fractions of the data in the training set, and calculate linear regression on the training set. Calculate and print the variance of the residuals of the training and test sets, and report whether you observe overfitting.

train_index <- sample(nrow(heart_data), size = floor(0.01 * nrow(heart_data)))

heart_train <- heart_data |>  
  slice(train_index) 

heart_test <- heart_data |>  
  slice(-train_index) 

lm_out <- lm(Ex2 ~ Rest2, data = heart_train)

summary(lm_out)

Call:
lm(formula = Ex2 ~ Rest2, data = heart_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.4794 -10.0917   0.0382  11.2450  15.1063 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  46.0002    36.6270   1.256    0.245
Rest2         0.6896     0.4555   1.514    0.168

Residual standard error: 13.05 on 8 degrees of freedom
Multiple R-squared:  0.2227,    Adjusted R-squared:  0.1255 
F-statistic: 2.292 on 1 and 8 DF,  p-value: 0.1685
print(paste("Residual variance for the 1% training set", var(lm_out$residuals)))
[1] "Residual variance for the 1% training set 151.390262719779"
y_pred <- lm_out$coefficients[1] + lm_out$coefficients[2]*heart_test$Rest2

print(paste("Residual variance for the 99% test set", var(y_pred - heart_test$Ex2))) 
[1] "Residual variance for the 99% test set 386.0053338943"

It really depends on the luck of the draw! For some random selections of training data at 5% or 10%, we observe overfitting (the error in the test set is significantly higher than in the training set), but not always. With 1% split, there’s almost always overfitting, though sometimes it goes the other way. I believe this has to do with the outliers I didn’t filter out.

Ecological data

The following data set contains observations of the populations of one species of fish (cutthroat trout) and two species of salamander in Mack Creek, Andrews Forest, Willamette National Forest, Oregon. The data set contains 16 variables and over thirty-two thousand observations. The variables include time and date, location, and measurements, such as size and weight. The metadata (descriptions of data) are provided here (click on “Data entities” tab for explanations of each variable.)

mack_data <- read_csv("https://raw.githubusercontent.com/dkon1/quant_life_quarto/main/data/mack_data.csv")
Rows: 32209 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): sitecode, section, reach, unittype, species, clip, notes
dbl  (8): year, pass, unitnum, vert_index, pitnumber, length_1_mm, length_2_...
date (1): sampledate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. You will use the numeric variables length_1_mm and weight_g, examine the data visually to see if there are any missing values or outliers. Clean the data to remove any values you want, and filter the data to contain observations from only one species, either: ‘Cutthroat trout’ or ‘Coastal giant salamander’. Split the remaining data into training and test sets of equal size.

    mack_data <- mack_data |> dplyr::select(species, length_1_mm, weight_g) |> 
      drop_na() |>  filter (species == 'Cutthroat trout')
    
    train_index <- sample(nrow(mack_data ), size = floor(0.5 * nrow(mack_data )))
    
    mack_train <- mack_data  |>  
      slice(train_index) |> 
      arrange(length_1_mm)
    
    mack_test <- mack_data  |>  
      slice(-train_index) |> 
       arrange(length_1_mm)
  2. Use linear regression on the training set and print out the summary. Make a scatterplot and overlay the linear regression model as a line on the same plot. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test.

lm_out <- lm(weight_g ~ length_1_mm , data = mack_train)

summary(lm_out)

Call:
lm(formula = weight_g ~ length_1_mm, data = mack_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.983  -2.860  -0.265   1.870 102.381 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -12.389375   0.132674  -93.38   <2e-16 ***
length_1_mm   0.256016   0.001473  173.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.12 on 6294 degrees of freedom
Multiple R-squared:  0.8276,    Adjusted R-squared:  0.8276 
F-statistic: 3.021e+04 on 1 and 6294 DF,  p-value: < 2.2e-16
# base R:
plot(weight_g ~ length_1_mm, data = mack_train, cex = .8, col = "blue", main = paste("Linear regression on mack data"))
abline(lm_out)

#produce residual vs. fitted plot
plot(fitted(lm_out), resid(lm_out), main = "Residuals of linear regression on mack data")

#add a horizontal line at 0 
abline(0,0)

# ggplot:
 mack_train |> ggplot() + 
  aes(x = length_1_mm, y = weight_g) + geom_point(color = 'blue') +
  geom_smooth( method = 'lm', color = 'darkorange') + ggtitle(paste("Linear regression on mack data"))
`geom_smooth()` using formula = 'y ~ x'

 ggplot(lm_out, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  ggtitle(paste("Residuals of linear regression on mack data"))

No, the residuals do not look like a shapeless blob! This indicated an essential nonlinearity in the data, and thus linear model is not appropriate, despite a high r-squared of over 80%. All parameters are “significant” as indicated by their low p-values.

  1. Perform quadratic regression on the training set using lm and print out the summary. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test, and report whether the fit is improved compared to the linear model.
quad_out <- lm(weight_g ~ poly(length_1_mm, degree = 2, raw = TRUE), data = mack_train)
summary(quad_out)

Call:
lm(formula = weight_g ~ poly(length_1_mm, degree = 2, raw = TRUE), 
    data = mack_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.025  -0.416  -0.011   0.298 100.063 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                 5.017e+00  1.794e-01   27.96
poly(length_1_mm, degree = 2, raw = TRUE)1 -2.053e-01  4.363e-03  -47.06
poly(length_1_mm, degree = 2, raw = TRUE)2  2.568e-03  2.379e-05  107.92
                                           Pr(>|t|)    
(Intercept)                                  <2e-16 ***
poly(length_1_mm, degree = 2, raw = TRUE)1   <2e-16 ***
poly(length_1_mm, degree = 2, raw = TRUE)2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.44 on 6293 degrees of freedom
Multiple R-squared:  0.9395,    Adjusted R-squared:  0.9395 
F-statistic: 4.887e+04 on 2 and 6293 DF,  p-value: < 2.2e-16
# base R:
plot(weight_g ~ length_1_mm, data = mack_train, cex = .8, col = "blue", main = paste("Quadratic regression on mack data"))

y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*mack_train$length_1_mm + quad_out$coefficients[3]*mack_train$length_1_mm^2

lines(mack_train$length_1_mm, y_pred, lwd = 3, col = 'darkorange')

#produce residual vs. fitted plot
plot(fitted(quad_out), resid(quad_out), main = "Residuals of quadratic regression on mack data")

#add a horizontal line at 0 
abline(0,0)

# ggplot:
 mack_train |> ggplot() + 
  aes(x = length_1_mm, y = weight_g) + geom_point(color = 'blue') +
  geom_smooth( method = 'lm', formula = 'y ~ poly(x, 2, raw=TRUE)', color = 'darkorange') + ggtitle(paste("Quadratic regression on mack data"))

  ggplot(quad_out, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  ggtitle(paste("Residuals of linear regression on mack data"))

The residuals now look shapeless, and the r-squared improves considerably (by 10% or more). All the parameters are still significant.

  1. Use the linear regression model parameters to compute the predicted values for the test set, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. Use the quadratic regression model to compute the predicted values for the test set using the quadratic regression coefficients, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. How did adding the quadratic parameter impact the error for the training and test sets?
print(paste("Residual variance for the linear model on the training set", var(lm_out$residuals)))
[1] "Residual variance for the linear model on the training set 16.9709086146212"
y_pred <- lm_out$coefficients[1] + lm_out$coefficients[2]*mack_test$length_1_mm

print(paste("Residual variance for the linear model on the  test set", var(y_pred - mack_test$weight_g))) 
[1] "Residual variance for the linear model on the  test set 15.8478334939268"
print(paste("Residual variance for the quadratic model on the training set", var(quad_out$residuals)))
[1] "Residual variance for the quadratic model on the training set 5.95334225461604"
y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*mack_test$length_1_mm + quad_out$coefficients[3]*mack_test$length_1_mm^2

print(paste("Residual variance for quadratic model on the test set", var(y_pred - mack_test$weight_g))) 
[1] "Residual variance for quadratic model on the test set 3.8948956296826"

The error improved dramatically for both the training and the test sets with addition of the quadratic parameter!

  1. Perform a cubic polynomial fit on the same data, print out the summary, and produce the same plots as before. Then compute and print the variance of the residuals for the training and test sets, and compare them. How did adding the cubic parameter impact the error for the training and test sets?

    cub_out <- lm(weight_g ~ poly(length_1_mm, degree = 3, raw = T), data = mack_train)
    summary(cub_out)
    
    Call:
    lm(formula = weight_g ~ poly(length_1_mm, degree = 3, raw = T), 
        data = mack_train)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -63.560  -0.567   0.046   0.476  99.954 
    
    Coefficients:
                                              Estimate Std. Error t value Pr(>|t|)
    (Intercept)                              9.170e+00  3.942e-01   23.26   <2e-16
    poly(length_1_mm, degree = 3, raw = T)1 -3.634e-01  1.407e-02  -25.82   <2e-16
    poly(length_1_mm, degree = 3, raw = T)2  4.303e-03  1.489e-04   28.90   <2e-16
    poly(length_1_mm, degree = 3, raw = T)3 -5.723e-06  4.850e-07  -11.80   <2e-16
    
    (Intercept)                             ***
    poly(length_1_mm, degree = 3, raw = T)1 ***
    poly(length_1_mm, degree = 3, raw = T)2 ***
    poly(length_1_mm, degree = 3, raw = T)3 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 2.414 on 6292 degrees of freedom
    Multiple R-squared:  0.9408,    Adjusted R-squared:  0.9408 
    F-statistic: 3.335e+04 on 3 and 6292 DF,  p-value: < 2.2e-16
    # base R:
    plot(weight_g ~ length_1_mm, data = mack_train, cex = .8, col = "blue", main = paste("Cubic regression on mack data"))
    
    y_pred <- cub_out$coefficients[1] + cub_out$coefficients[2]*mack_train$length_1_mm + cub_out$coefficients[3]*mack_train$length_1_mm^2 + cub_out$coefficients[4]*mack_train$length_1_mm^3
    
    lines(mack_train$length_1_mm,fitted(cub_out), lwd = 3, col = 'darkorange')

    #produce residual vs. fitted plot
    plot(fitted(cub_out), resid(cub_out), main = "Residuals of cubic regression on mack data")
    
    #add a horizontal line at 0 
    abline(0,0)

    # ggplot:
     mack_train |> ggplot() + 
      aes(x = length_1_mm, y = weight_g) + geom_point(color = 'blue') +
      geom_smooth( method = 'lm', formula = 'y ~ poly(x, degree = 3, raw = TRUE)', color = 'darkorange') + ggtitle(paste("Cubic regression on mack data"))

     cub_out |>  ggplot() + aes(x = .fitted, y = .resid) +
      geom_point() +
      geom_hline(yintercept = 0) +
      ggtitle(paste("Residuals of cubic regression on mack data"))

    print(paste("Residual variance for the cubic model on the training set", var(cub_out$residuals)))
    [1] "Residual variance for the cubic model on the training set 5.82443895879213"
    y_pred <- cub_out$coefficients[1] + cub_out$coefficients[2]*mack_test$length_1_mm + cub_out$coefficients[3]*mack_test$length_1_mm^2 + cub_out$coefficients[4]*mack_test$length_1_mm^3
    
    print(paste("Residual variance for cubic model on the test set", var(y_pred - mack_test$weight_g))) 
    [1] "Residual variance for cubic model on the test set 4.18022557773365"

Adding a cubic term improves the error for the training set by a small amount, but it completely blows up the error for the test set! This is a dramatic example of overfitting.